Load Necessary Packages

library(tidyverse)
library(tidymodels)
library(naniar)
library(skimr)
library(kableExtra)
library(vip)

Introduction

The arts in all forms, from live jazz, to operas, plays, classical symphonies, musicals, and dance performances, are a ubiquitous facet of modern life. Museums, theaters, small music venues and symphony halls are present in cities across the country and the world. They are often integral parts of how we define both our residential spaces and our culture, because of the ways in which participation in and exposure to the arts bring joy, meaning, and community into our lives. A pressing subject that we as data scientists, lovers of the arts, and American citizens wish to investigate, however, is the link between participation in the arts in America and various demographic and social factors, particularly income. That is why for our predictive modeling project, we have decided to utilize a data-set from the Local Area Arts Participation study from 1992. This study was funded by the National Endowment for the Arts (Research Division) and conducted by Abt Associates of Cambridge, MA. It contains information on whether and how often Americans attended and participated in various kinds of local art, including jazz concerts, ballet performances, etc. It also contains various demographic characteristics of the respondents. In collecting the data, Abt Associates randomly selected 5,040 Americans from 12 localities (Broward County, Chicago, Dade County, Las Vegas, Pittsburgh, Reno, Rural Nevada, San Jose, Seattle, Sedona, Winston-Salem) via a random-digit-dialing sampling methodology and contacted them by phone for this survey.

Thus, we plan to utilize the various variables regarding arts participation in this data-set, as well as demographic information like race, gender, and age, to predict the household income of the respondents in this survey. Hopefully, through the process of predicting respondent income, we can better understand who attends arts performances and exhibits, how often they do, and why other people don’t in order to eventually make the arts more accessible and appealing to everyone.

Data Overview and Missingness

The data from the Local Area Arts Participation study was downloaded from the National Archive of Data on Arts and Culture (see appendix for citation). It required some minor cleaning before it was ready for use in this project. This cleaning involved first converting the data into a tibble format. We also had to change data values from NA to zero in certain variables in which, given the context, this made more sense. (For example, participants were asked whether they had attended a jazz concert in the last 12 months, and if so, how many times. In the original dataset, respondents who said they had not attended a jazz concert in the last 12 months had a value of NA for the next variable/question, in which respondents answered how many times they attended a jazz concert in the last 12 months. This was the case for all similarly formatted pairs of variables. We converted these NA values to 0.) Further, in all categorical variables, we decided to convert levels of Don't know and Refused to NA. Also, for numeric variables, missing values were coded as numbers (as 98 and 99) and thus we refactored that in order to read the data correctly. We also made a few other minor adjustments to the data, such as dropping a few variables that we knew from the get-go would be irrelevant to our project, re-coding the names of the levels of income (our outcome variable) for easier readability, and cleaning the names of our variables to be in snake case.

After cleaning, we saved our processed dataset into an .rda file, which is read in below. The data is named local_arts_data. There are 5,040 total observations and 88 attributes. 65 of the characteristics are factored data while the other 23 are numeric data. The categorical characteristics generally involve whether the respondent attended or participated in local art, as well as demographic data. The numerical features include the number of times the respondent attended that specific event as well as numeric demographic data.

# Load the Data
load(file = "data/processed/local_arts_data.Rda")

After reading in the data, we performed an exploration of missingness. The results in the form of a table and a graph are shown below.

# Create function to select columns with missing values
has_na <- function(x){
  any(is.na(x))
}

# table of missingness
local_arts_data %>%
  # select only columns that have missing values
  select_if(has_na) %>%
  # summary of missing variables (from naniar)
  miss_var_summary() %>% 
  print(n = Inf)
## # A tibble: 80 x 3
##    variable n_miss pct_miss
##    <chr>     <int>    <dbl>
##  1 bar5       5029  99.8   
##  2 bar4       4946  98.1   
##  3 mags       4913  97.5   
##  4 more8      4823  95.7   
##  5 bar3       4612  91.5   
##  6 more7      4385  87.0   
##  7 radio      3943  78.2   
##  8 more6      3738  74.2   
##  9 tvtype     3642  72.3   
## 10 bar2       3590  71.2   
## 11 more5      2969  58.9   
## 12 mostimp    2370  47.0   
## 13 more4      2229  44.2   
## 14 more3      1570  31.2   
## 15 bar1       1483  29.4   
## 16 newsp      1432  28.4   
## 17 more1      1124  22.3   
## 18 moremost   1124  22.3   
## 19 more2      1009  20.0   
## 20 over18      940  18.7   
## 21 income      915  18.2   
## 22 source1     361   7.16  
## 23 rateinfo    322   6.39  
## 24 age         226   4.48  
## 25 ntvart      175   3.47  
## 26 ntvjazz     143   2.84  
## 27 ntvclass    137   2.72  
## 28 ntvdance    133   2.64  
## 29 ntvplay     128   2.54  
## 30 race        118   2.34  
## 31 ntvmus      116   2.30  
## 32 educ         96   1.90  
## 33 marital      95   1.88  
## 34 nbooks       84   1.67  
## 35 hhsize       73   1.45  
## 36 gomore       67   1.33  
## 37 npark        66   1.31  
## 38 tvart        64   1.27  
## 39 tvplay       56   1.11  
## 40 ntvopera     53   1.05  
## 41 tvmus        47   0.933 
## 42 nmuseum      38   0.754 
## 43 tvjazz       37   0.734 
## 44 schools      34   0.675 
## 45 nplay        32   0.635 
## 46 park         29   0.575 
## 47 tvdance      27   0.536 
## 48 howimp       27   0.536 
## 49 tvclass      25   0.496 
## 50 nfair        22   0.437 
## 51 play         21   0.417 
## 52 nmusical     20   0.397 
## 53 lisjazz      18   0.357 
## 54 lisplay      18   0.357 
## 55 nclassic     15   0.298 
## 56 musical      15   0.298 
## 57 njazz        14   0.278 
## 58 hearpoet     13   0.258 
## 59 museum       12   0.238 
## 60 lismus       12   0.238 
## 61 jazz          9   0.179 
## 62 fair          8   0.159 
## 63 tvopera       8   0.159 
## 64 lisclass      8   0.159 
## 65 lisopera      8   0.159 
## 66 cinema        8   0.159 
## 67 books         7   0.139 
## 68 classic       6   0.119 
## 69 nopera        5   0.0992
## 70 readplay      5   0.0992
## 71 readpoet      5   0.0992
## 72 opera         4   0.0794
## 73 nodance       4   0.0794
## 74 readnov       4   0.0794
## 75 hearnov       4   0.0794
## 76 nballet       3   0.0595
## 77 caseid        2   0.0397
## 78 odance        2   0.0397
## 79 ballet        1   0.0198
## 80 weight        1   0.0198
# graph missingness
local_arts_data %>% 
  # select only columns that have missing values
  select_if(has_na) %>% 
  # create missingness graph (from naniar)
  gg_miss_var()

80 of the 88 variables in our data-set contain missingness, some with a considerable amount of missingness. As can be seen in the table, over 20% of the values of variables from bar5 down to moremost are NA. That’s 18 variables with more than 20% of their values missing. The majority of the variables with missingness, however, do not have more than 20% of their values missing. We postulated that there are three reasons that this missingness occurs:

  • The respondent did not know the answer.
  • The question did not apply to this respondent.
  • The respondent refused to answer the question.

The majority of NA values are due to the second reason—that the question did not apply to the respondent. For instance, the columns with the most missing values, bar5 and bar4, ask about the fourth and fifth reasons for why the respondent did not attend cultural and artistic events more often. However, the previous three questions, bar1, bar2, and bar3, ask the same question, and it’s likely that most respondents didn’t have more than three reasons for not attending more cultural/artistic events. Thus, variables bar5 and bar4 did not apply to the respondent given that their reasons for not attending more cultural/artistic events were covered in their responses to bar1, bar2, and bar3. Another example is mags, which asks what magazines the respondent read to learn about art events in the community. The response that a person gives to this question would be dependent whether the respondent read magazines and cited them as a source for learning about art events in the first place (ie, their responses to the source variables, which as where a respondent learned about arts events in their community). Thus, if people didn’t learn about arts events through magazines (which clearly few people did given how many missing values this variable has), then they would not respond to this question and their value for this variable would be NA. For those predictors with missing values that do not fall under the category of the question not applying to the respondent, it is more likely that the respondent simply did not know the answer to the question/how to respond to the question than that they refused to answer the question.

For the variables with more than 20% missingness, we decided to remove most of them, as imputing such a large proportion of the data would make the data less reflective of the population. We decided to remove 14 predictors with a significant amount of missingness: bar5, bar4, mags, more8, bar3, more7, radio, more6, tvtype, bar2, more5, more4, more3, newsp. The code for that is below.

local_arts_data <- local_arts_data %>% 
  select(-bar5, -bar4, -mags, -more8, -bar3, -more7, -radio, -more6, 
         -tvtype, -bar2, -more5, -more4, -more3, -newsp)

We decided to keep one variable that had greater than 20% missingness, specifically mostimp, as that question relates to the most important reason the participant did not attend arts events more often.

For the rest of the variables with missingness at or below 20%, an imputation step will be used to replace those missing values in our model recipes.

Outcome Variable

In our predictive modeling project, we will attempt to predict the survey respondent’s household income, the variable income, using as predictors the other variables in the dataset, which are related to arts attendence and demographic information. income is a categorical variable with 9 levels indicating the range of the respondent’s household income. Before conducting an initial split on the data into a training and testing set, we wanted to visualize the outcome variable to understand its distribution and check for problems.

# Distribution of income (graph)
local_arts_data %>%
  ggplot(aes(x = income)) +
  geom_bar() +
  coord_flip() +
  labs(
    title = "Distribution of Respondent's Household Income"
  ) +
  theme_minimal()

# Table of the levels of income
local_arts_data %>% 
  count(income)
## # A tibble: 10 x 2
##    income                 n
##    <fct>              <int>
##  1 Under $10,000        315
##  2 $10,000 to $14,999   288
##  3 $15,000 to $19,999   375
##  4 $20,000 to $29,999   721
##  5 $30,000 to $39,999   679
##  6 $40,000 to $49,999   540
##  7 $50,000 to $74,999   698
##  8 $75,000 to $99,999   243
##  9 $100,000 or more     266
## 10 <NA>                 915

As the graph and table above show, there is a significant amount of missingness in the outcome variable, with a total of 915 missing values. Most of this missingness is because participants didn’t know their household income or because the respondent refused to answer the question, with some NA values being truly missing responses. For the rest of the data, the income categories with the highest counts are $50,000 to $74,999, $30,000 to $39,999, and $20,000 to $29,999. Categories with the lowest counts are $75,000 to $99,000 and $100,000 or more. Before splitting the data into a training and test set, it would make sense to get rid of the rows that contain missing values, since it would not make sense (nor would it be possible) to predict missing values. The code for that is below.

# Only keep observations with values for income
local_arts_data <- local_arts_data %>% 
  filter(!is.na(income))

Splitting and Cross Validation

We decided to initially split the data into 70% training, and 30% testing. The imbalance between the various classes of income shown above prove that it would be important to stratify our initial split (and later our cross-validation) by income. The training and testing sets are read in below. Code for how this was actually conducted can be found in the model_set_up.R script inside the “models” folder inside the “output” folder. The training data will be used to build our candidate models upon, and the testing data will be used as “new data” to test the final performance of our best model after tuning.

# load training data
load(file = "data/processed/local_arts_train.rda")

# load testing data
load(file = "data/processed/local_arts_test.rda")

After conducting an initial split, the training data was also folded using 5-fold cross-validation with 3 repeats. These folds were used later during the tuning process to collect reliable information on model performance and avoid overfitting. Folding the training data allows us to more reliably assess the performance of our models, ie decrease the bias of our results, before applying them to the testing data because we are not assessing the model on the entire training data, but rather on multiple small subsets. When we fold the training data, we split it into, in this case, 5 groups. For each fold, four groups are kept as the analysis set, used to build and train the model on, and one group is left out as the assessment set, to assess the performance of the model on. A different group is left out as the assessment set each time to ensure that the model that is being tuned is evaluated across different subsets of the training data. Doing repeated cross-fold validation then simply means that this folding process is repeated again, in this case two more times, with the data shuffled differently into the 5 groups. Repeating the folding process reduces the variance in the performance values collected from the assessment sets of the folds. The folded training data is read in below. Code for how this was actually conducted can be found in the model_set_up.R script inside the “models” folder inside the “output” folder.

# load folded object
load(file = "data/processed/local_arts_fold.rda")

Exploratory Data Analysis

Exploratory data analysis is a vital step in the modeling process. This is where relationships between the outcome and predictor variables are analyzed, and variables are examined for any problematic qualities that will need to be addressed in a model recipe.

After removing variables with a significant amount of missingness, there are 74 variables remaining, 73 of which may serve as predictor variables.

Frequency of Arts Attendence Predictor Variables

To begin our exploratory data analysis, we thought it would make sense to explore the relationship between income and the various variables that deal with attendence at/engagement with the arts. There are two sets of variables indicating participation in the arts. One set is categorical, indicating whether or not the participant has gone to the indicated arts event in the last 12 months. The other set is numeric, indicating how many times the participant has gone to the indicated arts event in the last 12 months. First, let’s look at how income is related to some of the categorical arts variables, ie whether or not a person has attended the specified arts event in the last 12 months.

# combined graph of categorical arts participation variables
local_arts_train %>% 
  select(income, jazz, classic, opera, musical, play, ballet, museum, fair, odance) %>%
  pivot_longer(-income, names_to = "var", values_to = "value") %>%
  ggplot(aes(x = income, fill = value)) +
  geom_bar(position = "fill") +
  coord_flip() +
  facet_wrap(~ var) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45),
        legend.position = "top") +
  guides(fill = guide_legend(title.position = "top")) +
  labs(
    fill = "Have you attended (*) in the last 12 months?",
    y = "Proportion of Participants",
    x  = "Income"
  )

The faceted graph above shows that for most arts activities, the number of people who attend them increases as income bracket increases. This pattern is particularly defined for attendance at museums and musicals. Only around 30% of those with incomes under 10,000 dollars attended museums within the year, while around 70% of those with incomes of 100,000 dollars or more attended museums within the year. Only about 15% of people with incomes under 10,000 dollars attended musicals within the year, while around 50% of those incomes of 100,000 dollars or more attended museums within the year. This pattern makes sense, given that arts events cost money (and sometimes quite a bit) to attend. Thus, those people with more financial ability and time are more likely to attend arts events.

Another categorical arts participation variable that we thought would be important to look at in relation to income was whether or not a person visited the movies in the past twelve months.

local_arts_data %>% 
  ggplot(aes(income, fill = cinema)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Going to the movies by income",
       fill = "Have you gone to movie theater\nto see a movie\nin last 12 months?",
       y = "prop")

The graph pretty clearly shows that people with higher incomes went to see a movie in the theaters more frequently in the last twelve months than people with lower incomes. cinema thus proves to be another important predictor variable to include along with the other categorical predictor variables related to arts participation shown above.

Next, let’s look at income in relation to the variables that indicate the number of times people attended the same arts events within the year.

# combined graph of numeric arts participation variables
local_arts_train %>%
  select(income, starts_with("n")) %>%
  select(., 1:10) %>%
  pivot_longer(-income, names_to = "var", values_to = "value") %>%
  ggplot(aes(x = value, y = income)) +
  geom_boxplot() +
  facet_wrap(~ var) +
  theme_minimal() +
  coord_cartesian(xlim = c(0, 10)) +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(
    title = "Frequency of Attendance at Arts Activities by Income",
    x = "number of times attending"
  )

It’s much harder to see a pattern with the numeric arts variables because there are so many people who did not attend the six above types of arts events at all within the year. However, there are a few variables in which a pattern with income is more clear. Again, there seems to a strong relationship between a person’s household income and the number of times they attended museums within the year, with those with a higher income attending museums more frequently. It also seems like people in the top few income brackets attended jazz and classical concerts, as well as musicals and plays, more frequently than those with less money.

Below is another graph of income in relation to more variables indicating the number of times the respondent participated in a specific activity in the last 12 months. nbooks refers to the number of books the respondent read; npark refers to the number of times the respondent visited a history park/monument; ntvart refers to the number of times the respondent watched a visual arts program on TV/VCR; ntvclass refers to the number of times the respondent watched a classical music program on TV/VCR; ntvdance refers to the number of times a respondent watched a dance program on TV; ntvjazz refers to the number of times a respondent watched a jazz program on TV; ntvmus refers to the number of times a respondent watched a musical on TV; ntvopera refers to the number of times a respondent watched opera on TV, and ntvplay refers to the number of times a respondent watched a play on TV.

# combined graph of numeric arts participation variables 2
local_arts_train %>%
  select(income, starts_with("n")) %>%
  select(., 1, 11:19) %>%
  gather(-income, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = income)) +
  geom_boxplot() +
  facet_wrap(~ var) +
  theme_minimal() +
  coord_cartesian(xlim = c(0, 20)) +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(
    title = "Frequency of Engaging in/Watching Arts Activities by Income",
    x = "number of times engaging"
  )

Again, for most of these arts activities, there doesn’t seem to be a noticeable pattern between the number of times the participant engaged with that activity and their household income. nbooks, however, is a exception. It’s very clear that participants with higher household incomes read more books within the year. There also seems to be a slight relationship between npark and income, with people with higher yearly incomes attending parks more frequently that those with lower yearly incomes.

It would also make sense to look at the distributions of these numeric variables related to attendance at, and engagement with, arts activities. Analyzing the distribution of numeric variables in one’s data-set is important in the pre-modeling EDA process. This is because many models are very sensitive to skewness, and thus skewness in numeric data must be identified and dealt with before the data can be fed to the models. Below, we selected just these numeric arts variables, and conducted a skim with charts.

local_arts_train %>% 
  select(starts_with("n")) %>% 
  skim()
Data summary
Name Piped data
Number of rows 2886
Number of columns 18
_______________________
Column type frequency:
numeric 18
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
njazz 8 1.00 0.76 4.31 0 0 0 0 97 ▇▁▁▁▁
nclassic 7 1.00 0.68 3.80 0 0 0 0 97 ▇▁▁▁▁
nopera 2 1.00 0.10 0.48 0 0 0 0 8 ▇▁▁▁▁
nmusical 13 1.00 0.65 1.61 0 0 0 1 30 ▇▁▁▁▁
nplay 13 1.00 0.58 2.02 0 0 0 0 50 ▇▁▁▁▁
nballet 2 1.00 0.15 0.80 0 0 0 0 25 ▇▁▁▁▁
nodance 4 1.00 0.35 2.78 0 0 0 0 97 ▇▁▁▁▁
nmuseum 17 0.99 2.43 8.34 0 0 0 2 97 ▇▁▁▁▁
nfair 11 1.00 1.65 4.04 0 0 1 2 97 ▇▁▁▁▁
npark 34 0.99 2.72 9.36 0 0 0 2 97 ▇▁▁▁▁
nbooks 35 0.99 14.00 22.83 0 1 5 15 97 ▇▁▁▁▁
ntvjazz 71 0.98 1.93 7.02 0 0 0 2 97 ▇▁▁▁▁
ntvclass 61 0.98 2.10 7.12 0 0 0 2 97 ▇▁▁▁▁
ntvopera 21 0.99 0.60 2.80 0 0 0 0 97 ▇▁▁▁▁
ntvmus 49 0.98 1.02 4.58 0 0 0 1 97 ▇▁▁▁▁
ntvplay 64 0.98 1.19 5.43 0 0 0 1 97 ▇▁▁▁▁
ntvdance 57 0.98 1.58 6.15 0 0 0 2 97 ▇▁▁▁▁
ntvart 89 0.97 3.07 9.86 0 0 0 3 97 ▇▁▁▁▁

Every variable is very clearly quite right skewed. This is because each variable has the bulk of it’s data around zero, a mean between zero and 15, and then data trailing off to some very high outliers. Furthermore, the small histograms shown in the skim display an asymmetric distribution for each of the selected variables. This skewness, as mentioned above, will need to be addressed in the modeling recipes.

Demographic Predictor Variables

Another set of variables we thought would be important to explore in this data analysis are those variables relating to the demographics of the respondent, such as age, gender, race, education level and marital status. Each of these demographic variables seem as if they could provide valuable insight into the income of the participant. For example, income generally increases when a person has more education education. There is also usually a correlation between race and income, with white households having higher household incomes than households of color.

We began by looking at the relationship between income and the education level of the participant.

local_arts_train %>%
  select(income, educ, race, gender, age, hhsize) %>%
  ggplot(aes(income, fill = educ)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Distribution of Education Levels across Income Brackets",
    fill = "Highest Education Level",
    y = "prop"
  )

As hypothesized, there appears to be a correlation between education level and income. The proportion of people with bachelor’s degrees and with grad degrees increases as income increases.

Since there are multiple levels of educ, it would make sense to visualize the distribution of this predictor.

# distributin of educ
local_arts_train %>% 
  mutate(
    educ = educ %>% fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(educ)) +
  geom_bar() +
  coord_flip() +
  labs(
    x = "Highest Education Level"
  ) +
  theme_minimal()

The levels with the highest counts are Some college, High school, and Bachelors degree. There are a few levels, Some grad. school, Vocational school, Grades k-8, No school, and Other, that have very few values and could pose a problem for modeling due to dummy coding. This will be addressed in our recipe by the addition of a step_other step in our recipe that combines infrequently occurring levels into a single other level.

The next demographic factor we explored in relation to income is gender.

local_arts_train %>%
  select(income, educ, gender, age, hhsize) %>%
  ggplot(aes(income, fill = gender)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Distribution of Gender across Income Brackets",
    fill = "Gender",
    y = "prop"
  )

As the income bracket increases, the proportion of women with that income decreases, and the proportion of men with that income increase. This is unsurprising, given the persistence of sexism in the workforce, particularly the gender pay gap and the glass ceiling in many top-tier, high-paying jobs.

Next, we explored the relationship between household income and age.

ggplot(local_arts_train, aes(age)) +
  geom_histogram(binwidth = 5) +
  theme_minimal() +
  facet_wrap(~ income) +
  labs(
    title = "Distribution of Age across Income Brackets",
    x = "Age",
    y = "Count"
  )

As the income bracket increases, it seems like the distribution of ages shifts slightly upwards. Meaning as income bracket increases, so does the average age of participants. This would make sense: younger adults tend to have less money due to their lack of experience. Older adults have more experience in the workforce and may be more financially responsible, leading to a higher incomes.

We also explored the relationship between a respondent’s marital status and their income.

local_arts_train %>%
  ggplot(aes(income, fill = marital)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Distribution of Marital Status across Income Brackets",
    y = "prop",
    fill = "Marital Status"
  )

The graph shows that there is very clearly a relationship between a respondent’s marital status and their income. There are a higher proportion of married people with higher incomes, and a higher proportion of single people with lower incomes.

Finally, the last demographic variable we explored was race. First, we looked at the relationship between a participant’s race and their income.

local_arts_train %>%
  ggplot(aes(income, fill = race)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Race by Income",
    fill = "Count",
    y = "prop"
  )

The graph very clearly shows that as the income bracket increases, so do the number of white people with that as their household income. There is a larger proportion of people of color represented in the lower income brackets than in the higher income brackets.

We also thought, given the number of levels of the variable race, that it would make sense to look at the distribution of race itself.

local_arts_train %>% 
  mutate(
    race = race %>% fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(race)) +
  geom_bar() +
  coord_flip() +
  theme_minimal()

White folks vastly outnumber folks of other races in this dataset. There is very little diversity in this variable, which will be a problem when dummy encoding. This will be addressed in our recipe by the addition of a step_other step in our recipe that combines infrequently occurring levels into a single other level.

Miscellaneous Findings

There were a few final things that we explored in our data-set prior to modeling. Given how many people responded that they had not attended or participated in a given arts event within the past year, we thought it would be interesting to look at some of the reasons why this is the case. Below is the distribution of the variable bar1, which indicates respondent’s first reason for not attending local arts events/not attending them more often.

local_arts_train %>% 
  mutate(
    bar1 = bar1 %>% fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(bar1)) +
  geom_bar() +
  coord_flip() +
  labs(
    x = "reason for not attending"
  ) +
  theme_minimal()

As you can see above, the most common reason for people not attending local arts events more frequently is that they don’t have time, followed by the cost of tickets and lack of childcare. Note also that most of the levels of bar1, except for (02) Don't have time, have only very few values. This could be a problem when dummy encoding for modeling as it will create variables with little to no variance. Thus, this problem will be dealt with via the addition of a step_zv step in our recipes to get rid of zero variance variables. It could also be interesting to see how these reasons for not attending local arts events change (or remain the same) across levels of income.

# distribution of bar1 by income
local_arts_train %>% 
  mutate(
    bar1 = bar1 %>% fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(bar1)) +
  geom_bar() +
  facet_wrap(~ income) +
  coord_flip() +
  labs(
    x = "reason for not attending"
  ) +
  theme_minimal()

It seems that across all income levels, lack of time is the primary reason for not attending local arts events more frequently.

Finally, using the caret package, we used nearZeroVar() to identify which variables truly had little to no variance. Such values would fail to provide meaningful insight or significantly affect the model. These variables were caseid, site, abtid, and exchange, which were all meaningless administrative variables, as well as source2, source3, source4, source5 which record the 2nd, 3rd, 4th, and 5th answers given by respondents as to their source of information about local arts events. All eight of these variables were removed from our initial data-set before splitting.


Ultimately, after removing these last eight variables, we decided to utilize all of the 65 remaining variables in our data-set as predictors. Although we didn’t explore every single variable in this data-set in our exploratory data analysis, from what we did explore there seems to be at least a mild relationship between most of the possible predictor variables and our outcome, income. The next step, then, after conducting an exploratory data analysis was to build pre-processing recipes in order to prepare our data for modeling, and then choose the types of models we wanted to tune.

Recipes

After exploring the data, an important next step in the modeling process is preparing the data to be fed to the models. No model can perform particularly well if the data is left in its raw condition, and thus feature engineering is used to perform various transformations on the data to make it as amenable to the models as possible. This can be done with a recipe using the recipes package from tidymodels. We used the same exact recipe for four of our models: the k-nearest neighbors, neural net, elastic net, and support vector machine. This recipe contained an initial step in which the recipe was established (using the recipe function). This step indicates income as our outcome, and all other variables as predictors. Next, the recipe contains an imputation step for missingness using bagged-tree models via step_bag_impute. Then we added step_YeoJohnson for all numeric variables. This transformation will help deal with the skewness found in our exploratory data analysis by making numeric predictors symmetrical. Next, we added two step_other steps, one for educ and one for race, which both had levels with very few values in them. These two steps indicate that, for both educ and race, the bottom 1% of levels should be combined into a single other level. Then, we added step_dummy to dummy-code all categorical variables. Dummy coding means that every categorical variable with n levels becomes n - 1 numerically encoded variables with values of 0 (if the observation does not take that level) or 1 (if the observation does take that level). We made sure to indicate that our outcome variable was not included in this dummy coding step even though it too is a categorical variable. Models will not run if the outcome variable is dummy coded. It needs to remain as it is. Next, we added step_normalize to center and scale all predictors. Centering the data places the mean at zero, and scaling standardizes the units of the data so that all variables have a standard deviation of 1. Scaling in particular is beneficial for nearest neighbors models, because they use distance metrics to calculate the k-nearest points to the new observation, and thus benefit from all variables being in the same unit. Similarly, elastic net models benefit from all variables being in the same unit to ensure the penalty used by the model effects all the variables in the same way. Finally, step_zv was included to remove any zero variance predictors after dummy coding. Zero-variance predictors result from levels of a categorical variable that occur so infrequently that they simply aren’t present in some re-samples. Thus, the dummy-encoded variables representing those levels in that re-sample contain only zeros. These variables hold no predictive information, and thus can cause problems for our model, which is why it is important to remove them. The code for this recipe is printed below.

# recipe used for knn, neural net, elastic net and support vector machine
recipe1 <- 
  # set outcome variable (income) and predictors (all other variables)
  recipe(income ~ ., data = local_arts_train) %>% 
  # impute missing data
  step_impute_bag(all_predictors()) %>%
  # Yeo-Johnson transform numeric variables to deal with skewness
  step_YeoJohnson(all_numeric()) %>%
  # create an other category for infrequently occuring levels
  # of educ and race
  step_other(educ, threshold = 0.01) %>% 
  step_other(race, threshold = 0.01) %>% 
  # dummy encode categorical predictors
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  # normalize all predictors
  step_normalize(all_predictors()) %>% 
  # remove zero-variance predictors
  step_zv(all_predictors())

The recipes for the last two models, the tree based models, were almost exactly the same as the above model, except for how we encoded categorical variables. For the boosted tree model, we decided to one-hot encode the categorical variables instead of simply dummy encoding. Instead of converting categorical variables with n levels into n-1 numeric variables, one-hot encoding converts every level of a categorical variable into it’s own variable with numeric values. We did this because, unlike models with linear dependencies, tree-based models can handle having all levels of a categorical variable represented as separate variables, and we thought it would be interesting to try out this kind of encoding to see how well the model performs. That recipe is below.

# recipe used for boosted tree model
bt_recipe <-
  # set outcome variable (income) and predictors (all other variables)
  recipe(income ~ ., data = local_arts_train) %>%
  # impute missing data
  step_impute_bag(all_predictors()) %>%
  # Yeo-Johnson transform numeric variables to deal with skewness
  step_YeoJohnson(all_numeric()) %>%
  # create an other category for infrequently occuring levels
  # of educ and race
  step_other(educ, threshold = 0.01) %>%
  step_other(race, threshold = 0.01) %>%
  # one-hot encode categorical predictors
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
  # normalize all predictors
  step_normalize(all_predictors()) %>%
  # remove zero-variance predictors
  step_zv(all_predictors())

Finally, for the random forest model, we did not dummy encode the categorical variables at all. This is because random forest models are one of the types of models that can handle categorical variables without converting them to a numeric form. Although leaving categorical variables the way they are is not proven to necessarily increase the performance of the model, it can help the model run faster, and so that is why we decided not to dummy code the categorical variables for the random forest. The recipe used for the random forest model is below.

# recipe used for random forest model
rf_recipe <-
  # set outcome variable (income) and predictors (all other variables)
  recipe(income ~ ., data = local_arts_train) %>% 
  # impute missing data
  step_impute_bag(all_predictors()) %>%
  # Yeo-Johnson transform numeric variables to deal with skewness
  step_YeoJohnson(all_numeric()) %>%
  # create an other category for infrequently occuring levels
  # of educ and race
  step_other(educ, threshold = 0.01) %>% 
  step_other(race, threshold = 0.01) %>% 
  # normalize all numeric predictors
  step_normalize(all_numeric()) %>% 
  # remove zero-variance predictors
  step_zv(all_predictors())

Models

We created six models. For each model, we selected certain parameters to tune, to identify the values of those parameters that would lead to the best performance. These models and the parameters tuned are below. (Note: each model was tuned in a separate r-script. These scripts can be found in the models folder located inside the output folder of our project directory.)

  • Boosted Tree: The boosted tree is an ensemble model that combines subsets of weaker trees. It builds regression trees iteratively; that is, each tree builds on the tree that came before it and minimizes the errors.
    • mtry: indicates the number of predictors randomly sampled at each split
    • min_n: the number of data points required for node to be split further
    • learn_rate: rate at which algorithm adapts from iteration-to-iteration
  • Elastic Net: This model focuses on adding regularization to the ordinary least squares model, preferring some bias over high variance. To do this, it adds bias towards some terms. In Ridge and Lasso regression, the penalty is set whereas in the generalized elastic net, the Ridge and Lasso is combined.
    • penalty: represents the amount of regularization
    • mixture: proportion of L1 (Lasso) regularization
  • K-Nearest Neighbors: The nearest neighbor model averages values of the nearest observations, with similar observations being closer in distance than different observations.
    • neighbors: the number of neighbors taken into account
  • Neural Network: Also known as multilayer perceptrons, this model starts out with a set of weights that are adjusted based on a correct or incorrect classification.
    • hidden_units: number of units in hidden model
    • penalty: amount of weight decay
  • Random Forest: The random forest is an ensemble model that combines a set of decision trees. For each decision tree, a random set of input variables are included. Then for the test data, the random forest considers the “vote” from each tree, with the class receiving the highest “vote” serving as the classification.
    • mtry: indicates the number of predictors randomly sampled at each split
    • min_n: the number of data points required for node to be split further
  • Support Vector Machine: This model identifies a hyperplane that maximizes the margin, the distance between it and the nearest data point.
    • cost: cost of predicting a sample within/on wrong side of margin
    • rbf_sigma: precision parameter for radial basis function

Evaluating Model Performance

After tuning our six different models, we evaluated how each model performed across the folds. To evaluate performance, the metric we decided to use was the area under the receiver operating curve, or roc-auc. Roc-auc gives us a measure of how well our model can differentiate between the different classes of our outcome variable. An roc-auc value closer to one means the model performs better. An roc-auc value of 0.5 would mean that the model performs no better than if we had simply predicted the outcome values at random.

Below is a table showing the roc-auc value, and standard error, of the best performing model of each model type across the folds, arranged in descending order by roc-auc.

# create tibble
best_models <- tribble(
  ~`Model Type`, ~`Roc-auc`, ~`Standard error`,
  #----------|---------|----------------|
  "Random Forest", 0.712, 0.00441,
  "Elastic Net", 0.684, 0.00264,
  "Boosted Tree", 0.688, 0.00265,
  "Support Vector Machine", 0.687, 0.00306,
  "Neural Network", 0.591, 0.0128,
  "K-Nearest Neighbors", 0.596, 0.00385
)

# present the table
best_models %>% 
  # arrange in descending order of roc-auc
  arrange(desc(`Roc-auc`)) %>% 
  # apply table styling functions from kable package
  kbl() %>%
  kable_styling()
Model Type Roc-auc Standard error
Random Forest 0.712 0.00441
Boosted Tree 0.688 0.00265
Support Vector Machine 0.687 0.00306
Elastic Net 0.684 0.00264
K-Nearest Neighbors 0.596 0.00385
Neural Network 0.591 0.01280

As you can see, the random forest model performs the best out of all the model types, producing an roc-auc value of 0.712. Furthermore, there is no overlap between standard error of the best random forest and the standard error of the best boosted tree, the second highest performing model. This gives us conclusive evidence that the random forest model is our best performing model, and thus the model that we will use to predict the outcome values on our test data. However, before we move on to finalizing our best random forest model workflow and applying it to the test data, let’s get a sense of what the parameter values are that created the best random forest model. Below, we read in the tuned object of the random forest model from it’s .rds file, and then using the autoplot() function to produce a graphical representation of the performance of the random forest model across different possible parameter values.

# read in rf_tuned_final
load(file = "output/models/rf_tuned.rda")

# autoplot
# specify roc_auc as our performance measure
autoplot(rf_tuned, metric = "roc_auc")

The auto-plot shows that the random forest model consistently produced the best roc-auc value with a minimal node size (min_n) of 40, except for a small overlap with a min_n value of 30 at an mtry value of 21. The autoplot also shows that roc-auc increased as the number of randomly selected predictors in a tree, mtry increased from 10 to 30, and then dropped off again between 30 and 56. Thus, the combination of parameters that produced the best random forest model was a min_n of 40 and an mtry value of 33, as you can see in the table below. (Note: it appears as though all five models performed the same because their roc-auc values were rounded to the thousandth decimal. But the model with an mtry of 33 and a min_n of 40 is at the top).

# show best performing random forest models
rf_tuned %>% 
  show_best(metric = "roc_auc")
## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    33    40 roc_auc hand_till  0.712    15 0.00441 Preprocessor1_Model23
## 2    33    21 roc_auc hand_till  0.712    15 0.00455 Preprocessor1_Model13
## 3    44    40 roc_auc hand_till  0.712    15 0.00445 Preprocessor1_Model24
## 4    21    30 roc_auc hand_till  0.712    15 0.00437 Preprocessor1_Model17
## 5    21    40 roc_auc hand_till  0.712    15 0.00430 Preprocessor1_Model22

Finalizing Random Forest Workflow and Variable Importance Plot

Now that we have chosen the random forest model, with a min_n of 40 and an mtry of 33, as our best model, it is time to finalize the random forest workflow with these parameters, and then fit the random forest model to the entire training data. This fitted model is what will be used to predict values of income using the testing data. The code used to do these two steps is printed below.

# set seed
set.seed(1234)

# finalize workflow
rf_final_workflow <- rf_workflow %>% 
  finalize_workflow(select_best(rf_tuned, metric = "roc_auc"))

# fit finalized workflow to all training data
rf_final_results <- fit(rf_final_workflow, local_arts_train)

Before moving on to evaluate the predictive performance of our model on the testing data, we thought it would be beneficial to examine the variable importance plot from the random forest model. This plot is developed when tuning the model, and describes which variables were used most frequently by the model to predict income, ie which variables had the most predictive power.

# create variable importance plot
rf_final_results %>% 
  pull_workflow_fit() %>% 
  vip()

Interestingly weight, which is a variable that gives the weighting factor used for each respondent, ended up being the most important variable used to predict income. After that, age, educ, and marital were the second, third, and fourth most predictive variables. This is rather unsurprising, given the correlation we saw between these demographic variables and income in our exploratory data analysis. It’s also worth noticing that none of the categorical arts participation variables made it into the top 10 most important predictive variables, only two numeric arts participation variables—-nbooks and nfair. Out of all the numeric arts participation variables, it makes sense that nbooks in particular made it into the top 10 most important predictive variables, given its distinct relationship with income.

Performance on Test Data

After fitting the model to the entire training dataset with tuned parameters, we reviewed its performance on the testing dataset.

load("output/models/random_forest_predictions.rda")
rf_final_pred %>%
  roc_auc(truth = income, `.pred_Under $10,000`, `.pred_$10,000 to $14,999`, `.pred_$15,000 to $19,999`,
                     `.pred_$20,000 to $29,999`, `.pred_$30,000 to $39,999`, `.pred_$40,000 to $49,999`,
                     `.pred_$50,000 to $74,999`, `.pred_$75,000 to $99,999`, `.pred_$100,000 or more`)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till      0.715

The overall roc-auc value is 0.715, which is similar to the random forest’s performance in the cross-validated folds. This consistency is a good sign, and signifies that our model didn’t overfit to the training data. While informative, this roc-auc value is an average across all classes. How did the performance of our model change across classes? To investigate this question, we can plot ROC curves for each class of our outcome variable. An ROC curve graphs sensitivity (which is the number of true positives, ie correctly classified values of the event class, divided by true positives plus false negatives, ie all members of the event class) by one minus specificity (which is the true negatives, ie all observations correctly classified as not the event class, divided by the true negatives plus false positives, ie all members of the not event class) at different threshold values. A threshold is the decision point at which observations are categorized into a certain class based on the probabilities the model designates for that observation for each class. ROC curves represent better model performance if they are closer to a right angle. An ROC curve that is close to or exactly at, a straight, diagonal line indicates that the model performed no better at distinguishing between the different classes of the outcome than if it had done so by random chance alone.

rf_final_pred %>%
  roc_curve(truth = income, `.pred_Under $10,000`, `.pred_$10,000 to $14,999`, `.pred_$15,000 to $19,999`,
                     `.pred_$20,000 to $29,999`, `.pred_$30,000 to $39,999`, `.pred_$40,000 to $49,999`,
                     `.pred_$50,000 to $74,999`, `.pred_$75,000 to $99,999`, `.pred_$100,000 or more`) %>%
  autoplot()

The roc-auc plots indicate that the random forest predicts the Under $10,000 and $50,000 to $74,999 income bracket well. For the rest of the income levels however, performance is significantly worse. For the income brackets starting with $20,000, $30,000, and $40,000, the curve closely hugs the y = x line, indicating that the model is not accurately distinguishing between the correct class and a different, incorrect class.

We can more closely examine the performance of our random forest model by reviewing a confusion matrix.

conf_mat(rf_final_pred, truth = income, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

The predicted classifications are on the y-axis while the true values are on the x-axis. The x-axis starts at Under $10,000 and goes to $100,000 or more. The diagonal indicates where the two meet; in other words, where the predicted value matches the true value.

It seems that the random forest model accurately predicts a lot of the values in the Under $10,000, $20,000 to $29,999, and $50,000 to $74,999 classes (which you can see because these boxes are more shaded in along the “accuracy” diagonal). For the rest of the classes, the random forest model does not do a very good job of correctly classifying the values. In fact, the random forest classifies the majority of its input as $20,000 to $29,999, 30,000 to $39,999, and $50,000 to $74,999, which you can see by the way these rows are very full and more shaded in. In the context of the data provided by the EDA, this makes sense: $20,000 to $29,999, 30,000 to $39,999, and $50,000 to $74,999 were the classes that had the highest proportion of observations in the dataset. Although we stratified both our initial split and our folds by income, it may be that there were not many distinguishing variables in the data. That is, none of the variables may have been good predictors for the income level, causing the random forest to essentially default to the classes with the highest proportion.

Conclusion

The random forest performed decently on the training and testing data. Given their empirical nature and affinity for high-dimensional data, this ensemble model performed the best of the six models we tested. We would be hesitant to endorse its use beyond this project, however, given that it can only provide accurate predictions for specific classes. Attendance of the arts is almost certainly not a great predictor of income regardless, given that participation tends to only be voluntary and self-selective, fluctuating with age, lifestyle, and responsibilities.

Next Steps

To provide better predictions, it could be invaluable to re-examine the original training dataset, reviewing the dataset with greater domain knowledge to identify variables that we may have discarded that may actually be useful. Another way to improve performance would be to review the correlation plot and only include variables strongly correlated with the outcome variable. That way, inferential statistics could be applied to improve upon our predictive statistical analysis. Future questions to explore are included here:

  • How can the age of the participant be predicted by other variables within the dataset?
  • Is participation in various arts activities more indicative or predictive of a respondent’s marital status or of their age than of the respondent’s income?

Data Ciation

National Endowment for the Arts. Local Area Arts Participation Study 1992. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2015-03-23. https://doi.org/10.3886/ICPSR35586.v1

Link to NADAC (National Archive of Data on Arts and Culture) site from which data was downloaded: https://www.icpsr.umich.edu/web/NADAC/studies/35586